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AN AUTOMATIC METHOD OF CLASSIFYING MOLECULES 

FIELD OF THE INVENTION 

The present invention relates to an automatic method and system for the 
identification of biological families of molecules, in particular proteins, and a method for 
creating an hierarchical organization within those families. 

BACKGROUND OF THE INVENTION 

In recent years we have been witnessing a constant flow of new biological data. 
Large scale sequencing projects throughout the world turn out new sequences and 
create new challenges for investigators. Many sequences that are added to the 
databases are unannotated and await analysis. Currently, 13 complete genomes (of 
yeast, E. coll other bacteria and several archae) are available. Between 35%-50% of 
their proteins have an unknown function (Penmisi 1997, Doolittle 1998). In the 
absence of structural data, analysis necessarily starts by investigating the sequence 
proper. Sequence analysis considers the basic properties of individual amino acids, 
as well as their combination. The most effective analyses compare the sequence 
under study with the whole database, in search for close relatives. Properties of a new 
protein sequence are extrapolated from those of its neighbors. Since the early 70's, 
algorithms were developed for the purpose of comparing protein sequences 
(Needleman & Wunsch 1970. Smith & Waterman 1981, Lipman & Pearson 1985, 
Altschul et al. 1990). 

It is generally accepted that two sequences with over 30% identity along much 
of the seequences, are very likely to have the same fold [Sander & Schneider 1991, 
Flores et al. 1993, Hilbert et al. 1993]. Proteins of the same fold usually have similar 
biological functions (with the exception of convergent evolution in which the same fold 
is shared by non homologous proteins). Nevertheless, one encounters many cases of 
high similarity both in fold and function, despite a low sequence similarity (Muzzin 
1993, Pearson 1997). Such instances are, unfortunately, often missed by current 
search methods. 
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Detecting homology may often help in determining the function of new proteins. 
By definition, homologous proteins have evolved from the same ancestor protein. The 
degree of conservation varies among protein families. However, homologous proteins 
almost always have the same fold (Pearson 1996). Homology is, by definition, a 
transitive relation: If A is homologous to B, and B is homologous to C, then A is 
homologous to C. This simple observation can be very effective in discovering 
homology. However, when applied simple-mindedly, this observation leads to many 
pitfalls. 

Though the common evolutionary origin of two proteins is almost never directly 
observed, we can deduce homology among proteins, with a high statistical confidence, 
given that the sequence similarity is significant This is particularly useful in the so 
called "twilight zone" (Doolittle 1992), where sequences are identical to, say. 10-25%. 
Transitivity can be used to detect related proteins, beyond the power of a direct 
search. 

Though transitivity is an attractive concept, some perils of transitivity should be 
taken into account as well. It should be clear that similarity is not transitive, and that it 
does not necessarily entail homology. Significant similarities can be used to infer 
homology, with a level of confidence that depends on the statistical significance (see 
Pearson 1996). Therefore similarity should be carefully used in attempting to deduce 
homology. The statistical significance level of the similarity should take into account 
the level of evolutionary divergence within the family, in order to deduce homology 
reliably. 

Multidomain proteins make the deduction of homology particularly difficult. If 
protein 1 contains domains A and B, protein 2 contains domains B and C, protein 3 
contains domains C and D, then should proteins 1 and 3 be considered homologous? 
This simple example indicates the inadequacy of single-linkage clustering for the 
purpose of identifying protein families within the sequence space. Expert biologists can 
distinguish significant from insignificant similarities. However, the sheer size of current 
databases rules out an exhaustive manual computation of homologies. Our goal was 
to develop an automatic method for classification of protein sequences based on 
sequence similarity, through the detection of groups of homologous proteins (clusters). 
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and high level structures (groups of related clusters) within the sequence space. Such 
organization would reveal relationships among protein families and yield deeper 
insights into the nature of newly discovered sequences. 



3 



wo 99/39174 



PCT/IL99/00057 



SUMMARY OF THE INVENTION 

It is an object of the present invention to offer a method for hierarchical 
organization of protein sequences in a database according to their biological functions, 
by using restricted transitivity for disclosing the similarity among proteins (splitting the 
space of all protein sequences into connected components or clusters wherein every two 
members are either directly or transitively related.) 

It is another object of the present invention to identify many biological families. By 
varying the threshold of statistical significance, finer sub- families that make up known 
families of proteins are being discovered. Likewise, this procedure exposes linkages 
between distinct protein families. Broadly speaking, protein families turn out to be 
connected in two distinct ways: (i) Through multi domain proteins, each of which is 
associated with a distinct protein family, or (ii) Through proteins much of whose 
sequence is shared by the two families. The latter may be considered as linkers or 
ancestor proteins. Consequently, many interesting relations between protein families are 
revealed and hierarchical organization within protein families suggest themselves. 

It is another object of the present invention to provide a new metric on the space 
of all protein sequences, which is more sensitive than the existing measures. Global self 
organization of all known protein sequences reveals inherent bilogical signatures. JMB 
268, 539-556 for analyzing new sequences (which are not on the SWiSSPROT 
database) and defining them according to the hierarchical organization. 
The present invention provides an automatic method of classifying molecules having 

similar biologic function comprising the steps of: a) creating a hierarchical organization of 

said molecules in a database, wherein groups of clusters are identified using local 

consideration resulting in related clusters; b) determining the position of a selected 

molecule based on the hierarchical organization of step a, whereby selected molecules 

of similar biologic function are classified. 

The present invention further provides an automatic method of classifying molecules 
having similar biologic function comprising the steps of: 
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a) creating a hierarchical organization of the molecules in a 
database comprising the steps of: 

i) calculating pairwise similarities between said molecules in said database 
by combining at least one standard measure of similarity, resulting in a first set of 
expectation values of similarity; 

ii) analyzing said first set of expectation values of similarity so as to obtain 
a second set of expectation values of similarity, wherein said molecules of the 
second set have a high degree of similarity; 

iii) merging said resulting molecules of step ii) so as to form clusters, 
wherein only molecules with an expectation value below a first restricted 
threshold are merged; 

iv) identifying groups of clusters using local consideration resulting in 
related clusters; 

v) determining the relationship between related clusters in said groups; 

vi) analyzing said groups of the related clusters of step v), thereby creating 
a hierarchical organization of said molecules; 

b) detemiining the position of a selected molecule based on the hierarchical 

organization of step a, comprising the steps of: 

i identifying pairwise similarities between said selected 
molecule to the molecules in said database by combining at least one 
standard measure of similarity, resulting in a third set of expectation 
values of similarity; 
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jj identifying geometric averaging of said selected molecule to 
each of said resulting clusters in said hierarchy, resulting in a forth set of 
expectation values of similarity; 

iii identifying related clusters from said hierarchy of step a, 
having a geometric averaging with said selected molecule below a 
second threshold, thereby classifying molecules having similar biologic 
function. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The invention is herein described, by way in example only, with reference to the 
accompanying drawings. In the drawings: 

Figures 1A-1C are graphic representations of the average distribution of e-values 
according to three main algorithms for sequence comparison, BLAST, PASTA and SW 
respectively, according to an embodiment of the invention; 

Figure 2 is a schematic tree of the hirarchical organization of small G-protein family; 
Figure 3 is a schematic tree of the hirarchical organization of ATP-binding transporter 
family; 

Figure 4 is a schematic tree of the super- family of motor proteins; 

Figure 5 is a schematic tree of the hirarchical organization of proteins that are involved in 

the biosynthesis of complex sugars; 

Figure 6 is a schematic alignment of two protein sequences sw:acsa-acexy and 
sw:ythl-rhoso; 

Figure 7 is a schematic tree of the hirarchical organization of methylases and the 
methyltransferases; 

Figure 8 is a schematic alignment of two protein sequences sw:hmt1-yeast and 
sw:y912,haein; 

Figure 9 is a table showing PROSITE families for which performance was below 50% 
true positives; 

Figure 10 is a table showing correlation of releases according to an embodiment of the 
invention; 

Figure 1 1 A is a table showing clusters possibly related to cluster 4 of the immunoglobulin 
V region, according to an embodiment of the invention; 
Figure 1 1B is a table showing clusters possibly related to cluster 5 of the 
immunoglobulins and major histocompatibility complex, according to an embodiment of 
the invention; 

Figure 12 is a schematic illustration tracing the formation of clusters; 

Figure 13 is a schematic illustartion showing the association of groups according to an 

embodiment of the invention; 
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Figure 14 is a schematic illustration of the clustering alsorithm according to an 
embodiment of the invention; 

Figure 15 is a table showing the largest clusters at the lowest confidence level 10"^; 
Figure 16 is a graphic representation of the correlation of BLAST e-values and SW 
e-values; 

Figure 17 is a schematic representation of the hierarchial organization of the small 
G-protein family; 

Figure 18 is a schematic map of the immunoglobulin superfamily according to an 
ambodimant of the invention; 

Figure 19 is a table showing the distribution of cluster sizes at various confidence levels; 
Figure 20 is a table showing performance evaluation; 

Figure 21 is a table showing the Largest clusters for which there was no 
corresponding family in PROSITE 13 or Pfam 1.0; and 

Figure 22 is a table showing clusters belonging to the immunoglobulin superfamily. 
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DETAILED DESCRIPTION OF THE INVENTION 

The present invention overcomes the drawbacks of the prior art by using 
transitivity in a controlled way (restricted transitivity) enabling organization of all the 
proteins in the database in a hierarchical organization. 

The present invention applied the following methods for comparing sequences 
against the databases: the Smith Waterman (SW), as disclosed in Smith, T. F, & 
Waterman, M. S. (1981), Comparison of Biosequenoes, Adv. inAppi Math. 2, 482-489, 
the FASTA, as disclosed in Lipman, D. J. & Pearson, W. R, (1985), Rapid and sensitive 
protein similarity, Science 227, 1435-1441 and the BLAST, as described in Altschul, S. 
F., Carrol. R. J. & Lipman. D. J. (1990), Basic local alignment search tool, J, Moi Biol. 
215, 403-410. These papers are incorporated by reference in their entirety herein. The 
results are analysis at different thresholds. The analysis starts at a very high and 
restricted threshold, where the proteins are organized in small clusters. When the 
threshold is increased in a step wise manner the clusters merge to give larger clusters, 
subfamilies and finally families. When a protein belongs to a certain cluster it influences 
and is influenced by the other members of the clusters. Thus, the question whether a 
certain cluster will merge with another cluster depends on the similarity of all of the 
protein that are in the cluster. There could be a situation in which cluster A contains a 
protein showing similarity to cluster B and therefore tends to merge with cluster B. but the 
other members of cluster A which do not show the same degree of similarity will prevent 
this merge. 

By performing this procedure at varying thresholds, in a stepwise manner a 
hierarchical organization of the connected components is obtained, and thus of all known 
proteins. As disclosed herein, the analysis of the invention starts from a very 
conservative classification, based on highly significant similarities, that consists of many 
classes. Subsequently, classes are merged to account for less significant similarities. 
Merging is performed via a two phase algorithm. First, the method identifies groups of 
possibly related clusters using local considerations, as will be described below. 
Subsequently, a global test is applied to identify nuclei of strong relationships within 
these groups of clusters, and clusters are merged accordingly. This process is repeated 



9 



wo 99/39174 



PCT/IL99/00057 



at varying levels of statistical significance, obtaining a hierarchical organization of all 
proteins. 

The resulting classification splits the protein space into well defined groups of 
proteins, which are closely correlated with natural biological families and superfamilies. 
Different indices of validity were tested to assess the quality of the classification. 
When compared with the protein families in PROSITE and Pfam databases, the 
classification has detected between 64.8% and 88.5% of the proteins in these families, 
as well as many new clusters which do not match any protein family in these 
databases. The hierarchical organization reveals finer subfamilies that make up 
known families of proteins as well as many novel relations between protein families. 

The method of the present invention is carried out by software or software 
means (data) executable on computing (data processing) means, such as a computer 
(PC) or similar data processor, microprocessors, embedded processors, 
microcomputers, microcontrollers, etc. These computing means are capable of 
performing processes typically algorithms, performing the method of the present 
invention. 

The procedure of the present invention was performed for the 
SWISSPROT database. As disclosed in Bairoch, A. & Boeckman, B. (1992) The 
SWISSPROT protein sequence data bank. {Nuci Acids Res. 20, 2019-2022). 

The procedure starts from creating a neighbors list for each sequence in 
the swissprot database in each of the following methods: the Smith Waterman (SW), 
the FASTA and the BLAST. A numerical normalization is applied first to all methods, 
so they are all on comparable scales. Then, only statistical significant similarities are 
maintained in these lists. These lists induce agraph to obtain a hierarchical 
organization of all of the proteins that in the database. The vertices of this graph are 
the protein sequences. Edges between the vertices are weighted with weights that 
reflect the distance or dissimilarity between the corresponding sequences, i.e. high 
similarity translates to a small weight (or distance). To compute the weight of the 
directed edge from A to B, one compares A against all sequences in the swissprot 
database, and obtains the distribution of its score. The weight is taken as the 
expectation value as disclosed in Atschul, S. F., Boguski, M. S., Gish, W. G. & 



10 



wo 99/39174 



PCT/1L99/00057 



Wooton, J. C. (1994) Issues in searching molecular sequence databases. Nature 
Genetics 6, 119-129 of the similarity score between A and B, based on this 
distribution. This is a statistical estimate for the number of occurrences of the 
appropriate score at a random setup, assuming the existing amino acid composition. 
When the similarity score is statistically insignificant, the corresponding edge is 
discarded. In other words, an edge among sequence A and B indicates that the 
corresponding proteins are likely to be related. Finally, the weight of an edge is defined 
as the minimum associated to it by any of the three methods, to capture the apparently 
strongest relation. When all edges of weight are eliminated below a certain 
significance threshold, the graph splits into connected components. These 
automatically induced sets of proteins that are closely correlated with natural biological 
families. By performing this procedure at varying thresholds, in a stepwise manner a 
hierarchical organization of the connected components is being obtained, and thus of 
all known proteins. 

This graph encodes much of the fundamental properties of the sequence 
space. It reflects the idea that interesting homologies among proteins can be deduced 
by transitivity. The transitive closure of the similarity relation among the proteins shown 
by the graph, splits the space of all protein sequences into connected components or 
clusters. These are proper subsets of the whole database wherein every two 
members are either directly or transitively related. These sets are maximal in this 
respect and cannot be expanded. Thus they offer a self-organized classification of all 
protein sequences in the database. These connected components can be expected to 
correlate with known biological families. The results suggest that these connected 
components are indeed very informative and robust. 

An attention should be drawn that this is a directed graph, so it is not 
necessarily symmetric. Specifically, it may (and does) happen that there is an edge 
from protein A to protein B, but none in the reverse direction. Furthermore, even if 
both edges exist, their weights may differ. Therefore, our notion of a component is 
that of a strong connected component. The partition into strongly connected 
components is thus more refined then the partition into connected components. This 
analysis can be performed at different thresholds, or confidence levels, to obtain an 
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hierarchical organization. Several connected components of a given threshold 
may fuse together at a more permissive threshold. The analysis starts at the 
10'"'°° threshold. Subsequent runs are carried out for 10"®^,10*®*^, ...10"^ = 1. 
Almost all of the clusters that were found are meaningful. Some correspond to 
well known families, but many others correspond to less studied families. There 
are clusters that consist exclusively of unknown proteins or hypothetical proteins. 
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Table 1 .Distribution of clusters by their size at each confidence level. Table 1 shows 
the distribution of clusters by their size at the different confidence levels. At each level, 
the universe of all proteins splits into connected components (clusters). These 
clusters become larger and coarser with the decrease of confidence levels. 
Consequently, the number of isolated proteins (clusters of size 1) decrease. As 
confidence level decrease to 10'^ There is a sharp decline in the number of midsize 
clusters. Chance similarities tend to blur the picture, and cause an "avalanche", where 
(possibly unrelated) many families are joined to few giant clusters. 

Reference is now made to Figs. 1A-1C. These figures are based on the 
following main algorithms for measuring similarity between protein sequences : the 
BLAST, the FASTA and the SW respectively. These methods are in daily use by 
biologists, for comparing sequences against the databases. Though SW tends to give 
the best results, it is not uncommon that FASTA or BLAST are more informative as 
disclosed in Pearson, W. R. (1995). Comparison of methods for searching protein 
sequence databases. Protein Science 4, 1145-1160 therefore the results of the three 
systems are being incorporated in order to achieve maximum sensitivity. For example 
SW is being used with the BLOSUM50 scoring matrix while FASTA and BLAST are 
being used with the BLOSUM62 scoring matrix in order to achieve better identification 
of remote homologous. Another example is when searches are strongly biased 
because the amino acid composition of the query sequence differs markedly from the 
overall average composition. A case in point is the effects of low complexity segments 
within the sequence as disclosed in Altschul, S. F., Boguski, M. S., Gish, W. G.& 
Wooton, J. C. (1994). Issues in searching molecular sequence databases. Nature 
Genetics 6, 119-129, Therefore, the results of BLAST (which uses the SEG program) 
as disclosed in Wootton, J. C. & Federhen, S. (1993). Statistics of local complexity in 
amino acid sequences and sequence databases. Comput Chem. 17, 149-163 in 
amino acid are being used for excluding low complexity segments from the query 
sequence. 

In order to compare the scores between the three methods one may pick any 
protein, carry out an exhaustive comparison against the whole database and consider 
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the highest score in each of the methods, then plot these values and compare two 
methods at a time. These scores show a very strong linear relation in log-log scale, 
therefore introducing a usually small multiplicative factor which per each protein and 
per method would scale the three methods to a single reference line. 

The differences between FASTA and SW are mostly due to the different scoring 
matrices that are being used and can be corrected by multiplying the original score by 
the relative entropy of the two matrices as described in Altschul, S. F. (1991) Amino 
acid substitution matrices from an information theoretic perspective. JMB 219, 
555-565. The difference between SW and BLAST may be due to approximations in 
estimating the parameters X and K as described in S. Kariin & S. F. Adtsdiul. (1990), 
Methods for assessing the statistical significance of molecular sequence features by 
using general scoring schemes. PNAS 87, 2264-2268. The underlying assumption in 
calculating these parameters is that the amino acid composition of the query sequence 
is close to the overall distribution. This assumption often fails, e.g. for low complexity 
segments. Moreover, these parameters are based on first order statistics of the 
sequence, the scoring matrix and the database. The corrections that are required to 
match SW and BLAST may be due to inaccurate approximations of the estimated 
parameters, or due to higher order statistics of the sequence. 

It is very difficult to set a clear dividing line between true homologies and 
chance similarities. Expectation values below 10'^ can be safely considered significant 
and those above 10 reflect almost pure chance similarity. However, the range within is 
difficult to characterize, and truly related proteins may have expectation values around 
1. An overly strict threshold will miss important similarities within the twilight zone, 
whereas an excessively liberal criterion will create many false connections. The exact 
threshold for each method was set to best discern among related and unrelated 
proteins. In the present invention it is based on the overall distribution of distances 
over the entire protein space, as given by each of the three methods. 

This is illustrated in Figs. 1A - 1C, which show the average distribution of 
expectation values over the entire SWISSPROT database, for BLAST, FASTA and 
SW respectively. The graphs in Figs. 1A-1C naturally suggest a threshold for each 
method. The distribution drawn a log-log scale is nearly linear, at low expectation 
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values, but starts a rapid increase at a certain value. This value is set to be the 
threshold. The thresholds for SW, FASTA and BLAST are set at 0.1, 0.1 and 10'^ 
respectively. An edge from vertex A to vertex B is maintained only if a significant 
score is obtained on comparing the corresponding proteins. Namely, if either SW or 
FASTA yield an expectation value < 0.1 or BLAST'S expectation value is < 10"^. 
A major difference between BLAST and SW/FASTA is that BLAST charges no gap 
penalties. Consequently, BLAST tends to overestimate the statistical significance of 
alignments. The method of the present invention counters this behavior of BLAST by 
selecting the edges asymmetrically as shown in figs, 1A-1C. While this property may 
help BLAST reveal significant similarities that the other methods miss (as described in 
Pearson, W. R. (1995). Comparison of methods for searching protein sequence 
databases. Protein Science 4, 145-1160) it could also bring to highly fragmentary 
alignments that cannot be considered biologically meaningful. Therefore, the method 
of the present invention ignores those BLAST scores that come from a large number 
of HSPs (high scoring pairs), where the MSP (maximal segment pair) is insignificant 
Finally, even if the comparisons between proteins A and B fail to satisfy the previous 
criteria, the edge from A to 6 is maintained when all three methods yield an 
expectation value < 1 . 

Reference is now made to Figs. 2 and 3. These figures are examples of 
hierarchical organization within known families. This organization is based on the 
information extracted while moving across the different levels of the tree which means 
scanning the hierarchy over all levels. Fig. 2 describes the small G-protein/Ras super 
family. This family of proteins is composed of the sub-families: ras, rab, ran, rho, ral, 
and smaller sub-families. Fig. 2 depicts the relations within this family, based on the 
hierarchical organization obtained by the method of the present invention. A total of 
366 proteins, all belonging to the small G-protein super-family, are presented. Small 
clusters, which correspond to subfamilies, are formed at the high levels of 
confidences, and fuse to larger clusters, when the threshold is raised. At the low level 
of confidence of 10"^*^, this family merges with the ADP ribosylation factors family and 
guanine nucleotide-binding proteins, all of which are GTP-binding proteins, to form one 
cluster. 
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Fig. 3 describes the ATP-binding transporters family. Transporters are 
membranous elements which provide the mechanism by which components cross the 
lipid bilayer within and from the environment of cell compartments. The large variety of 
components, environmental conditions and organisms make this super-family very 
complex and rich. This diverged family is another example for which an hierarchical 
organization is proposed. The sub-classification distinguishes between amino-acids 
transporters, oligopeptide transporters, metal transporters, multi drug resistance 
proteins, and many more subgroups. Out of 296 proteins presented here, 75 are 
hypothetical transporters which can be classified based on this organization, according 
to their position in the tree. 

Reference is now made to Figs. 4,5 and 7. These figures demonstrate how 
transitivity can be used to verify the relation between different, but functionally related, 
protein families and how the connections are created when moving from one 
confidence level to the next level. 

Fig. 4 describes the hierarchy of the super-family of motor proteins. Each circle 
indicated by the numeral 100, represents a connected component at threshold 10"^^. 
Circles' radii are proportionate to the component's size. The component's size appears 
next to the corresponding circle. The drawn edges appeared upon lowering the 
threshold to 10'^°. The letters A-D indicate the order of transitivity indicated by the 
vertical line 101. Each cluster is referred to by its order of transitivity, A-D, and its 
position from left to right. Clusters which consist solely of hypothetical proteins are 
indicated by a second superimposed dashed circle 102 . 

Fig. 4 demonstrates how, in some cases, the connection between functionally 
related proteins is revealed only through the connection with hypothetical proteins. 
One such example is the connection between the myosins and the kinesins. The 
isolated sets (at the level of 10'^^) of kinesin and kinesin-like proteins (sets C1, D1, 
D2), myosins (set A1), axoneme-associated proteins (set C2) , and trichohyalin (set 
B2), were grouped together, in some cases via connections with hypothetical proteins 
(sets B1 , C3), to form a super family of motor proteins (at the level of 10"^°) with a total 
of 120 proteins. All proteins share an elongated structure, and energy dependent 
motor activity and are expressed throughout the evolutionary tree. They do vary in 
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their directionality of action, tissue specificity and their highly diverged biological 
contexts. 

Similarly Fig. 5 describes the hierarchy of proteins involved in the biosynthesis 
of complex sugars, where each circle represent a connected component at threshold 
10"^^, and the drawn edges appeared upon raising the threshold to 10"^. Indeed, a 
relation between those families is established thanks to the raising of the confidence 
level from 10'^^ to 10*^. As in the previous example, the connection is established via 
hypothetical proteins, based on weak alignments. However, the basic biological 
feature which characterizes all these proteins, makes the connections inevitable. 

Fig. 6 describes the alignment of the sequences sw:acsa-acexy and 
sw:yth1-rhosom, (which correspond to sets A1 and B2 respectively in Fig, 5). Fig. 6 
shows 24% identity and this demonstrates the weak alignment achieved when 
comparing protein sequences, where there are similar biological features connecting 
the two sequences. In contrast, Fig. 5 shows the connection of the sequences 
classified to sets A1 and B2. Thus if s possible to identify that these two sequences 
belong to the same biological family. This connection can only be achieved by the 
method of the present invention as was shown in Fig. 5, 

Similarly Fig. 7 describes the hierarchy of proteins involved in the methylase, 
methyltransf erase family where each circle represents a connected component at 
threshold 10"'*^, and the drawn edges appeared upon raising the threshold to 10'^. This 
family is another example for the importance of hypothetical proteins as linkage 
proteins. Through such links a natural connection between related biological families is 
established, as demonstrated for methylasels and methy [transferases. 

A total of 80 proteins in 28 isolated sets (at the level of 10"^°) were connected to 
one cluster at the level of 10*^. The Y-axes of the graph represents 11 orders of 
transitivity (labeled A-K). At the bottom end (A1, B1, B2,C1) as well as at the top end 
(J2) methylases and methyltransieraser are common. Many hypothetical proteins are 
scattered within these clusters. Some clusters contain exclusively hypothetical proteins 
(E2, H1, 12), The connection between the two ends of this graph is made through such 
clusters (H1,I2). 
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Fig. 8 describes the alignment of sequences sw.hnnti -yeast and sw:y912-haein 
that correspond to sets G1 and H1 respectively, in Fig. 7. Fig. 8 shows 30% identity 
which demonstrates the weak alignment achieved when comparing protein 
sequences, where there are similar biological features connecting the two sequences. 
In contrast Fig. 7 shows the connection of the sequences classified to sets G1 and H1. 
The connections are based on very sparse pairwise alignments as compared with Fig. 
8, and thus raise a reasonable doubt on the biological significance. Yet, proteins at the 
two ends of the graph of Fig. 7 exhibit close biological function, therefore verify the 
validity of these connections. 

EXPERIMENTAL DETAILS SECTION 

Automatic classification of protein sequences, a 
hierarchy of protein families, and local maps of the protein space 

Motif and domain based analyses: Most of these studies yielded databases of protein 
motifs and domains, which have become an important tool in the analysis of newly 
discovered protein sequences. Among these are PROSITE (Bairoch et al. 1997), 
PRINTS (Attwood et al. 19981, Blocks, HenikoffL& Henikoff 1991), Pfam 
(Sonnbammer et al. 1997), ProDom (Sonnhammer & Kahn 1994), and Domo (Gracy & 
Argos 1998), The manually defined patterns in PROSITE have served as an excellent 
seed for several such studies. 

There are several aspects in which these studies differ from each other. Some 
are based on manual or serni-manual procedures (e.g,. PROSITE, PRINTS), others 
are semi-automatic (Pfam) and the rest are fully automatic (e.g. ProDom, Blocks, 
Domo). Some focus on short motifs (PROSITE, PRINTS, Blocks) while others seek 
whole domains and try to infer domain boundaries (Pfam, ProDom, Domo). Most 
databases also give the domain/motif structure of proteins. Two databases make use 
of transitivity to enhance sensitivity. Prodom applies the transitive closure of high 
scoring segments pairs obtained by BLAST (only when the common segments 
overlap, above a minimum overlap parameter). In the Pfam database, the construction 
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of new families starts form a model (HMM) derived from multiple alignment of related 
proteins, which is being improved iteratively by searching for more related sequences 
in the database. These are iteratively incorporated into the model, till convergence. In 
each iteration the resulting alignment is traced manually, to avoid misalignments. 

Protein based analysis: These studies can be divided into two categories. The first 
category includes methods that employ alternative representations of protein 
sequences, e.g. their di peptide composition (van Heel et al. 1991, Ferran et al. 1994) 
or combination of compositional properties and other physical/chemical properties 
(Hobohm and Sander 1995). The alternative representations induce measures of 
similarity/dissimilarity between complete protein sequences. These measures can in 
turn, be used (van Heel et al. 1, Ferran et al. 1994) to classify the sequences into 
(usually a fixed number of) clusters. The second category includes methods that draw 
directly on pairwise comparison, e.g. (Gonnet et al, 1992, Watanabe & Otsuka 1995, 
Koonin et al, 1996, Harris et al. 1992, Tatusov et al. 1997, Krause & Vingron 1998). All 
these works induce a clustering of the input database, based on the transitive closure 
of similarity scores (i.e. single linkage clustering). Among these works, two (Harris et 
al. 1992, Tatusov et al. 1997) have addressed the problem of multi-domain proteins. 
Harris et al. allow groups to merge only if they share k overlapping regions. However, 
they concluded that k=1 is the best choice for maximum accuracy. Thus their 
clustering procedure essentially remained a single-linkage clustering (in multi-trait 
proteins, regions are classified to multiple classes). In the second study (Tatusov et 
al. 1997), clusters are created starting from triangles of homologous proteins from 
different lineages. Triangles which share an edge are merged (this requirement 
reduces the probability that unrelated clusters merge). An additional (manual) step is 
carried to split clusters which are incorrectly merged due to multi-domain proteins. 

Conceptual difference: The important role of motifs and domains in defining protein's 
functionality is unquestionable. Detecting a known motif within a new protein 
sequence can help reveal its function and lead to the correct assignment of the new 
sequence to an existing protein family. Indeed, domain-based studies have added 
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much to our knowledge. However, in many cases, characterizing a new protein only 
by its domain content, seems insufficient. This happens, for example, when no known 
domains are apparent in the new protein. In some instances, only few related 
sequences are available, too few to define a reliable prototype signature or a profile of 
the common domains. Therefore, a proper analysis of a new protein sequence should 
incorporate comparisons against domain based databases, as well as sequence 
databases. In this view, an analysis which identifies groups of related proteins in 
databases of protein sequences is valuable. It may amplify the outcomes of an 
exhaustive pairwise comparison. When close hits are already grouped together based 
on their mutual similarity, this may highlight a similarity with a group which could 
othenvise be missed by a simple manual scanning. Moreover, if several groups are 
found related to the query sequence, this may indicate the existence of several distinct 
functional/structural domains. If all groups share the same region of similarity, this 
adds insights about the relations between the different groups. In some cases, this 
may suggest that the groups belong to the same family or superfamily. Some of this 
information, but not all of it, can also be inferred from comparisons wifh domain 
databases. The domain-based databases usually offer a lot of detailed information 
about domains and the domain structure of proteins, through multiple alignments, and 
schematic representations of proteins. This information is of great value for biologists. 
Moreover, unlike domain based methods, protein-based methods can be applied to all 
protein sequences and are more easily automated. 

The present invention belongs to the second category. It draws on pairwise 
similarities and looks for strongly connected sets of proteins. It applies a moderate 
version of transitive closure, in an attempt to eliminate chance similarities and avoid 
indirect multiple-domain-based connections. 

When begining with a very strict, high-resolution classification that employs only 
connections of very high statistical significance, clusters are merged to fonn bigger 
and more diverse clusters. The algorithm operates hierarchically, each step 
considering weaker connections in addition to the connections that are already 
accounted for, A statistical test is applied in order to identify and eliminate 
suspicious/problematic connections as well as possibly false connections between 
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unrelated proteins. The output is thus an hierarchical organization of all protein 
sequences. The algorithm incorporates no further biological information and uses only 
the similarity scores that are provided by standard methods. 

The method described here applies to the set of all SWISSPROT (Bairoch & 
Boeckman 1992) sequences, and yields an exhaustive classification. This approach 
leads to the definition of a new pseudo-rnetric on the space of all protein sequences. 
In most cases, this emerging metric turns to be more sensitive than the existing 
measures on the basis of which it was derived. Such metrics are necessary in the 
quest of a global self organization of all protein sequences, as discussed in (Linial et 
al. 1997). 

Methods 

This section contains a description of the computational procedure of the present 
invention. The procedure was carried out on the SWISSPROT database (Bairoch & 
Boeckman 1992) release 33, with a total of 52205 proteins. 

Defining the graph 

The space of all protein sequences is represented as a directed graph, whose vertices 
are the protein sequences. Edges between the vertices are weighted with weights that 
reflect the dissimilarity between the corresponding sequences, i.e. high similarity 
translates to a small weight. To compute the weight of the directed edge from A to B, 
one compares A against all sequences in the SWISSPROT database, and obtains a 
distribution of scores. The weight is taken as the expectation value (Altschul et ah 
1994) of the similarity score between A and B, based on this distribution. This is an 
estimate for the number of occurrences that the appropriate score could have been 
obtained by chance, i.e. when compared with random sequences drawn from the 
same background distribution (usually defined as the distribution of amino acids 
overall the database). A low expectation value reflects a strong connection, of high 
statistical significance, and a high expectation value entails an insignificant, weak 
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connection (expectation value decreases as the length of the similar region decreases, 
and self similarity of short proteins is less significant then self similarity of long 
proteins. Therefore, when distance function between protein sequences is sought, the 
expectation values may need to be corrected for the query length). Not all edges are 
retained in the graph. Edges of statistically insignificant similarity scores get discarded 
(details below). In other words, in the final graph, an edge between sequences A and 
B indicates that the corresponding proteins are likely to be related. 
This graph has been constructed, using the common algorithms for protein sequence 
comparison; Smith Waterman dynamic programming method (SW) (Smith & 
Waterman 1981), FASTA (Lipman & Pearson 1985) and BLAST (Altschul et al. 1990). 
The SW algorithm was run with the BLOSUM62 matrix (Henikoff & Henikoff 1992) and 
gap penalties of -12,-2 using the Bioccelerator hardware (Compugen) or the *ssearch' 
program which is part of the FASTA 2 package by Bill Pearson. FASTA was run using 
the 'fasta* program with the BLOSUM50 matrix (Henikoff & Henikoff 1992) and gap 
penalties -14,-2 (the default setting). Both 'ssearch' and 'fasta' calculate expectation 
values based on empirically derived distribution of scores (Pearson 1998) (the 
Bioccelerator apply the same procedure for assessing the significance of results, as in 
ssearch). The BLAST algorithm was run with the BLOSUM62 matrix using the blastp 
3 program available from the NCBI ftp site. The program reports similarity scores 
along with the probability that the scores could have occurred by chance. All these 
methods are in daily use by biologists, for comparing sequences against the 
databases. Though SW tends to give the best results on average, it is not uncommon 
that FASTA or BLAST are more informative, especially when combined with different 
scoring matrices (Pearson 1 995). Therefore all three methods were incorporated into 
the graph, to achieve maximum sensitivity. 

The following sections contain a detailed description of the procedure of assigning 
weights to edges. The procedure starts by creating a list of neighbors for each 
sequence, based on all the three methods. In order to place all three methods on 
comparable numerical scales, a numerical normalization is applied first to all methods. 
Then, only statistically significant similarities are maintained in these lists. Finally, the 
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weight of an edge is defined as the minimum associated to it by any of the three 
methods, to capture the apparently strongest relation. 

Placing all methods on a common numerical scale 

It is relatively easy to compare between scores that a particular method assigns to 
different comparisons. However, how does one compare between scores that are 
assigned by different methods? The following calculation was performed: Pick any 
protein, carry out an exhaustive comparison against the whole database and consider 
the highest scores in each of the methods. Now plot these values and compare two 
methods at a time. These Scores show a remarkably strong linear relation in log-log 
scale (Fig. 16), therefore by introducing a (usually small) correction factor, per each 
protein and per method, the three methods get scaled to a single reference line. 

Fig, 16 is a graphic representation of the correlation of BLAST e-values and SW 
e-values. (a) BLAST e-values of neighboring sequences of 1431 Jyces (P42651) vs. the 
SW e-values of the same neighbors. The graph is plotted in log-log scale. Note the 
strong linear correlation between the scores assigned by the two methods, where the 
correlation coefficient is 1.16, i.e. e-value BLAST= (evaluesw)^ ^^. The fitness of the linear 
line is even better when the line is of the form a * x + b. For 1431Jyces the best fit 
achieved by l1*x-5.5, i.e, evalueBLAST =: 10"^^ * (evaluesw)^^ (b) BLAST e-values of 
neighboring sequences of laOS-human (P04439) vs. the SW e-values of the same 
neighbors. Note that the correlation coefficient is 1 .3 in this case (the improvement of the 
fitness by introducing the 'b' term in the linear fit is marginal in this case). 

Defining the list of neighbors 

It is, of course, very difficult to set a clear dividing line between true homology and 
chance similarity. Expectation values below 10'^ can be safely considered significant 
and those above 10 reflect almost pure chance similarities. However, the midrange is 
difficult to characterize, and truly related proteins may have expectation values around 
1. An overly strict threshold will miss important similarities within the twilight zone. 
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whereas an excessively liberal criterion will create many false connections. The exact 
threshold for each pairwise comparison method was set to best discriminate among 
related and unrelated proteins. The choice is based on the overall distribution of 
expectation values over the entire protein space, as given by each of the three 
methods. 

This is illustrated in Figs. 1A-1C, which shows the distribution of expectation values 
over the entire SWISSPROT database, for SW, FASTA, and BLAST, based on the 
neighbors lists of all protein sequences in the SWISSPROT database. The distribution 
may be thought as the average distribution of expectation values, for a 'typical' protein 
sequence as a query. The graphs in Figs. 1A -1C naturally suggest a threshold for 
each method. The distribution drawn in a tog-log scale is nearly linear at low 
expectation values, but starts a rapid increase at a certain value. The steep slope 
indicates a rapid growth in the number of sequences that are unrelated to the query 
sequence (high expectation values). The distribution may differ from one sequence to 
another, and accordingly the threshold may change. However, the number of 
neighboring sequences of a single protein is usually not enough to deduce a reliable 
threshold. Only when the distributions are averaged then a general threshold is 
obtained, which can be used a s a guideline. 

In this view, the threshold was set at the value where the slope rapidly changes. The 
thresholds for SW, FASTA and BLAST are set at 0.1, 0.1 and 10'^ respectively. An 
edge from vertex A to vertex B is maintained only if a significant score is obtained on 
comparing the corresponding proteins. Namely, if either SW or FASTA yield an 
expectation value < 0.1 or BLAST'S expectation value is < 10"^. 

While the self-normalized statistical estimates of FASTA and SW (Pearson 1998) are 
quite reliable (see also Brenner et al. 1998), the statistical estimates of BLAST may be 
effected by the amino acid composition of the query sequence, and an unusual 
composition (e.g. low complexity segments within the sequence) may strongly bias the 
results of a search (Altschul et al, 1994). Therefore, the results of BLAST were 
consulted following a filtering of the query sequence, to exclude low complexity 
segments, using the SEG program (Wootton & Federhen 1993). Consequently, 
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filtering may significantly reduce the number of high scoring hits reported by BLAST. 
However, if only sequences that pass the filter are acknowledged, many relations of 
biological significance may be missed. Instead, a more stringent threshold is set in 
this case for BLAST at 10'^. 

A major difference between BLAST and SW/FASTA is that BLAST does not 
count/penalize for gaps (The new version of BLAST does account for gaps in the 
alignment. This heuristic is a very good approximation to the SW algorithm, whose 
main advantage is in its speed. It is, however, inferior to the rigorous SW algorithm. 
In contrast, the old version of BLAST occasionally detects similarities that are missed 
by SW (e.g. for the Glucagon precursor family, and the H -transporting ATP synthase 
[Pearson, 1995]).. BLAST defines the similarity based on one or more high-scoring 
segment, pairs (ungapped local alignments), and the significance is assessed by 
applying Poisson or sum statistics (Altschul et al. 1994). Consequently, since gaps 
are ignored, BLAST tends to overestimate the statistical significance of fragmented 
alignments, and this behavior of BLAST is countered by the above asymmetry in 
selecting the edges. While this property may help BLAST reveal significant similarities 
that the other methods miss (e.g. Pearson 1995), it is cautious to be aware of highly 
fragmentary alignments that cannot be considered biologically meaningful. Therefore, 
those BLAST scores that come from a large number of HSPs (high scoring pairs), 
when the MSP (maximal segment pair) is insignificant are ignored. 
Finally, even if the comparisons between proteins A and B fail to satisfy the previous 
criteria, the edge from A to S is maintained when all three methods yield an 
expectation value < 1 . 

This procedure is designed/helps to screen most of the chance similarities in the 
neighbors list of each protein sequence. Unfortunately, chance similarities may 
occasionally pass the set criteria. A major goal of the algorithm that is described next 
is to detect such similarities and eliminate them. 

Exploring the connectivity 
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In exploring the graph clusters of related sequences which hopefully have a 
characteristic biological function are sought. 

There are two major obstacles which should be considered: i) Multi domain proteins 
can create undesired connection among unrelated groups; ii) Overestimates of the 
statistical significance of similarity scores may bias decisions. Naturally, chance 
similarities become more abundant as significance levels decrease. 
Therefore, transitivity should be applied restrictively. By analogy, if transitivity is to be 
viewed as a force that attracts sequences, then it should be countered by some 
"rejecting forces" so that unrelated clusters be kept apart and prevent a collapse in the 
protein space (Though at the level of 10"^°^ this does not occur). 

The approach 

The approach is to begin by eliminating all edges of weight below a certain, very high, 
significance threshold (i.e. low expectation value). This operation splits our graph to 
many small components of strong connectivity. In biological terms, we split the set of 
all proteins into numerous small groups of closely related proteins, which correspond 
to highly conserved sub families. 

To proceed from this basic highly restrictive classification, the threshold is raised, in a 
stepwise manner, and more relaxed statistically significant similarities are taken into 
account. In so doing, several clusters of a given threshold may merge at a more 
permissive threshold. However, this process is closely monitored and a merge is 
allowed only when strong statistical evidence is found for a true connection among the 
proteins in the resulting set. 
Basic classification 

If all edges of weight below a certain significance threshold are eliminated, the 
transitive closure of the similarity relation among proteins splits the space of all protein 
sequences into connected components or clusters. These are proper subsets of the 
whole database wherein every two members are either directly or transitively related. 
These sets are maximal in this respect and cannot be expanded. Thus they offer a 
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self-organized classification of all protein sequences in the database. The threshold is 
set at the very stringent significance level of 10'^°°. Similarities which are reported as 
significant above the level of 10'^°° are very conserved and stretch along at least 150 
amino acids. Thus, neither chance similarities, nor connections based on a chain of 
distinct common domains in multi-domain proteins occur at this level . The resulting 
connected components can be safely expected to correlate with known highly 
conserved biological subfamilies. 

Note that this is a directed graph, and hence is not necessarily symmetric. 
Specifically, it may happen that there is an edge from protein A to protein B, but none 
in the reverse direction (a directed graph is strongly connected if for every two vertices 
there is a directed path from x to y as well as from y to x.) Furthermore, even if both 
edges exist, their weights usually differ. Therefore, the notion of a component is that 
of a strongly connected component. The partition into strongly connected components 
is clearly more refined that the partition into connected components. 

The clustering algorithm 

The procedure is recursive. That is, given the classification at threshold T, a method 
for deriving the classification at the next more permissive level should be given, that is 
10^ - T. The start is from the basic classification at 10'^°^. 

Figure 14 describes the clustering algorithm. Phase 1) Identify pairs of clusters that 
are considered as candidates for merging. Decisions are made based on the 
geometric mean of the pairwise scores of the connections between the two clusters, if 
this mean exceeds a specific threshold then the cluster is accepted as a candidate, 
and enters a pool of candidates. OthenA^ise it is rejected (denoted in the figure by *X'). 
Phase II) pairwise clustering is applied to identify groups of clusters which are strongly 
connected. At each step the two closest groups are chosen from the pool and merged 
provided that the geometrical mean of all pairwise similarities pass the threshold. 
Otherwise they stay apart (denoted by dashed line). 

The algorithm runs in two phases. First groups ("pools") of clusters that are 
considered as candidates for merging are identified and marked (see Fig. 14). A local 
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test is performed where each candidate cluster is tested with respect to the cluster 
which "dragged" it to the pool, to check their mutual similarity. 

To quantify the similarity of two clusters P and Q, the geometric mean of all painA^ise 
scores of protein pairs one of which from P and the other from Q. are calculated. 
Unrelated pairs are assigned the default (insignificant) e-value of 1. The geometric 
mean reflects the distribution of pairwise connections between the two dusters, such 
that random or unusual connections have only a little effect. When the geometric 
mean of the e-values is below ^/T (more significant) the interpretation is that P and Q 
are indeed related and that their connection does not reflect chance similarities or 
chain of domain-based connections. The level of confidence in the reliability of the 
connection clearly decreases as T increases. The Quality of the P - Q connection is 
defined as minus the log of the geometric mean. This quantity ranges between 0 and 
100, and the higher it is, the more significant the connection. 

At the end of the first phase a group of clusters is left which are candidate for merging. 
This group is the input for the second phase. 

At the second phase a variant of a pairwise clustering algorithm is carried out. This 
algorithm successively merges only pairs of clusters that pass the above test and are 
not suspected as representing chance or domain-based similarities. At each step the 
two closest clusters are chosen (based on the quality of their connection) and merged 
if their similarity (as quantified above, and based on all pairwise similarities between 
the new formed clusters) is more significant than the threshold. The process stops 
and the final clusters are defined (Fig. 14) when the similarity of the next closest 
clusters do not pass the threshold. 

All the rejected merges are marked and registered for further biological analysisTthese 
rejected merges are refered to as possibly related clusters. 

This analysis is performed at different thresholds, or confidence levels, to obtain an 
hierarchical organization. The analysis starts at the 10'''°° threshold. Subsequent runs 
are carried out at levels 10'^^ , 10'®^ , 10'^^ , 10'^° . The process terminates at the 
threshold of 10'° =1. Above the threshold of 1 almost all similarities are in fact chance 
similarities (see previous section). 
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Results 

Nearly all of the clusters found are biologically meaningful, some corresponding to well 
known families, and many others represent less studied families. Some clusters 
exclusively consist of unknown proteins or hypothetical proteins. 
The essence of this work is a fully automatic classification of ail protein sequences, 
without incorporation of any biological considerations other than the use of standard 
painA/ise sequence comparisons. The analysis concerns complete proteins, and is not 
limited only to those subsequences which are identified as functionally or structurally 
important motifs and domains. Indeed, not all the emerging clusters are correlated 
with a specific domain, although some of the clusters encountered are characterized 
by a domain that is common to many or all member proteins. 

General information 

Figure 19 shows a table showing the distribution of cluster sizes at various confidence 
levels. At each level, the set of all proteins splits into clusters, which merge to form 
larger and coarser clusters as the confidence level decreases. In particular, the 
number of isolated proteins (clusters of size 1) diminishes as well. At the lowest 
significance level there are 10.602 clusters, of which 4435 contain at least 2 members, 
1006 clusters have size 10 and above. 

The number of clusters (of size bigger than 1) at each level of confidence ranges 
between 4228 and 5543. How many clusters should there be ? A lower bound for this 
number is provided by the number of different folds, since it is generally expected of 
members of the same cluster to have similar folds. The current estimates place the 
number of folds between several hundreds and one thousand (Wang 1996, Chothia 
1992). However, the same fold is usually adopted by few different superfamilies which 
share little or no sequence similarity. Therefore, in a sequence based analysis it 
should probably be expected that these superfamilies correspond to different clusters, 
possibly related/connected. Moreover, superfamilies may consist of several families, 
sometimes with as little as few percentages of sequence identity. Consequently, these 
families may be classified to different clusters, depending on the method sensitivity. All 
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together the total number of clusters is expected to exceed the number of folds, but it 
is quite safe to predict that by no more than a whole order of magnitude. A number of 
clusters in the 4000-6000 range is, thus, consistent with this estimation. 
Almost each of these thousands of clusters is biologically meaningful. Many of the 
smaller clusters suggest a definition of new biological families, and often a 
characteristic feature is evident. Figure 15 shows only the 50 largest clusters at the 
lowest confidence level (10"^) that is considered. The description attached in the table 
of Fig. 15 to each cluster is based mainly on the SWISSPROT annotations of its 
members. The table should be viewed only as a sample. Henceforth, unless 
otherwise stated, cluster numbers refer to level 10 °, 

As is apparent from Fig. 15, largest clusters are at the lowest confidence level (10'°). 
(i) Clusters are ordered in decreasing order of size (ii) The order of transitivity within 
each cluster is defined as follows: take one of the proteins in the cluster as the 
cluster's seed, The seed's order of transitivity is 0, Its neighbors are of order 1. 
Additional proteins that are neighbors of 1st order proteins, are of order 2, etc. The 
definition does not depend on the choice of the seed protein, (iii) The family 
description states the main feature of the member proteins. 



Performance evaluation 

It is very hard to evaluate the validity of classifications that merge from a large scale 
study of protein sequences. No generally accepted standards have been set yet in 
this field. Thus over the years, new classifications were traditionally compared with 
what is considered (one of) the state of the art characterization of protein sequences, 
namely, the PROSITE dictionary of signature patterns, motifs, and domains (Bairoch 
et al, 1997). For domain based studies, a comparison with the manually derived 
PROSITE dictionary is inevitable, and essential in order to verify that the results are 
biologically meaningful. However, when the analysis is not limited to regions which 
are known or suspected as domains, no standard benchmark exists to assess the 
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quality of the results. One may then resort to domain-based databases for a 
guide-line. Obviously, this may bias the assessment, and should be kept in mind 
when evaluating the results. 

To estimate the quality of the classification it is compared v^ith two databases, 
PROSITE and Pfam (both domain-based databases), using four indices of quality. 

The evaluation methodology 

Given a reference classification A and a new classification B of the same set X, the 
quality of the classification U is evaluated in terms of the reference classification A, by 
means of their mutual agreement . The two classification can be either "hard", i.e. each 
protein is classified to exactly one group, or "soft" where each protein can be classified 
to more then one group. In the present case, Pfam and PROSITE are soft, while the 
present is a hard classification . 

Gracy and Argos (Gracy & Argos 1 998) have proposed a procedure for performance 
evaluation. Each class a E A is associated with the group b E B which maximizes the 
quantity tp-fp-f n. Here true positives (tp) are given by /anb/ - 1, false positives (fp) are 
given by ib \ al and false negatives (fn) are given by la \ bl. Quality is defined by the 
percentage of the true positives 100 - tp/tp+fp+fn. 

In the same way, the percentage of false positives and false negatives are calculated. 
To simplify the comparison between the present results and their results, this index is 
used, denoted Qsingle. 

The procedure resembles the one suggested by Gracy and Argos. The difference is 
that each group a E A is associated with one or more groups b from B, where each 
group b satisfies tp > fp . Specifically, that a group b E 6 is a relative of a group a E A 
if more than 50% of its members are also members of a (see figure Fig. 13). For each 
group a E A all its relatives b in S are identified. The union of all the relatives of a is 
denoted by ba. A protein is misclassified by classification B if it is a member of a 
missed by bg (false negative), or is a member of ba .but not a member of a itself (false 
positive). The conjunction of bg and a defines the group of proteins which are correctly 
classified. 
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Figure 13 schematically shows association of groups in classification B with groups in 
the reference classification A. Groups B I- B 4 are relatives of the ellipse (group of A), 
while groups 85 and B6 are not. 

The quality Qset of the classification for the group a is defined by the percentages of 
the true positives (proteins which were correctly classified) out of the total number of 
proteins in a and ba. i.e, 100. a n bg 

aUba 

This measure accounts for both error types. This procedure is repeated for every 
group a E A, and the total percentage of true positives is given by the average overall 
groups a E A. 

Since dividing a family a into many small clusters (and in the extreme, to singletons 
clusters) is not desirable, another quality index is also defined, that accounts for the 
number of clusters which are relatives of a, and decreases as this number increases. 
This is done by first subtracting the number of relatives of a from the number of true 
positives, then calculating the percentages out of the total number of proteins in a and 
ba, i.e. 1 00. la n ba ' number of relatives of a + 1 

aUba 

This way a family of size N which is mapped to a single cluster of size N has quality of 
1 (or 100%), while a family of size N which correspond to N singletons in the present 
classification has quality 0, This index is denoted by Qset-relatives. 
The fourth index of quality is given by the quantity tp/(tp^fn), which is the portion of the 
reference family a which is in the set ReU. This measure does not take into account 
the false positives. The reason it is used is that not all false positives are indeed false 
positives (as shown in the next sections). This way all false positives are assumed to 
be potential related sequences. Obviously, this is not true. However, it is useful as an 
upper bound on the quality of the classification, assuming that at least some false 
positives are essentially potential related sequences. It is denoted by Qupper-bound 
A consistency test was proposed by Krause and Vingorn (Krause & Vingorn 1998). 
Although their test can be used for rough self-validation, it is less useful for assessing 
the quality of a new classification, with respect to a reference classification. 
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The reference databases 

The present classification was compared with two domain based classifications: 
PROSITE and Pfam. The present classification contains 52,205 proteins of the 
SWISSPROT database, release 33, classified to 10,602 clusters (at the lowest level of 
the analysis), of which 1006 are of size 10 and above (see Fig 19). The PROSITE 
database release 13 (released/associated with the SWISSPROT 33 database) 
contains 24156 proteins, characterized by 1151 different signature patterns. It is very 
common in PROSITE that the same family is characterized by few signature patterns. 
Therefore, all patterns that are documented the same in PROSITE are considered to 
be the same family. The exceptions are those who never appear together in the same 
protein though documented the same. Also considered different (for the purpose/in 
the scope of assessment) are those groups that differ by more than 10 proteins. Also, 
patterns which are always associated with other patterns (i.e. the corresponding 
groups are completely included in bigger groups) are ignored though documented 
differently in PROSITE (e.g. INTEGRIN RETA in EOF, POU-1 and POU-2 in 
HOMEOBOX). Overall, within these terms, the 1151 signatures characterize 874 
protein families and domains, of which 600 are of size 10 and above. The Pfam 
database, release 1,0 (associated with SWISSPROT. 33) contains 15604 proteins, 
classified to 175 families, of which 172 are of size 10 and above. 

Evaluation results 

The results of the evaluation procedure for these reference databases are given in 
table in Figure 20. The evaluation is based on all families with at least 10 members 
(the same analysis with all families with at least 5 members gave the same results up 
to deviation of less then 1%). 

In (Gracy & Argos 1998) a similar procedure with respect to a database which is a 
combination of PROSITE and PIR (George et al, 1996) resulted in 96.6 % true 
positives (1.8% false positives) in PROSITE (which means that the combined 
database is very close to the PROSITE database), 93.2 % true positives in DOMO 
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(0.3% false positives) and 65.1% true positives in ProDom (0.9% false positives). All 
three are domain-based classifications. The first three nunnbers in the table of Fig. 20 
assess the performance of the method of the present invention in the same way as the 
procedure in (Gracy & Argos 1998) and therefore are approximately in the same scale. 
With 69% true positives, the present work compares favorably with ProDom, though 
false positives are more abundant. However, not all of these false positives should be 
counted as false positives (see next section). 

Since the present is a hard classification, the second measure Qset is found to be 
more appropriate. This means that no single cluster is associated with one domain 
family. The different clusters may correspond to different subfamilies, but the overall 
structure is still detectable through connections between clusters (see section 'possibly 
related clusters'). An example is the PROSITE family AA TRNA TIGASE.I family. 
Seven different clusters form the cover of this family corresponding to the subfamilies: 
leucyl/isoleucyl/valyl/methionyl-trna synthetase (cluster 1 52), glutamyi/prolyltrna 
synthetase (cluster 429), tryptophanyl-trna synthetase (cluster 758), arginyl-trna 
synthetase (cluster 988), tyrosyl-trna synthetase (cluster 904), cysteinyl-trna 
synthetase (cluster 1216), and a singleton (cluster 6647) arginyl-trna synthetase (a 
very short fragment). With this measure the quality of performance reaches 76.7%. 

Drawbacks of the evaluation methodology 

The above procedure may result in an over-strict measure, and a few factors that may 
suppress the performance need some careful inspection. One such factor is the 
amount of false positives. In many cases the supposedly false positives are actually 
true positives. For example, short fragments which are undoubtfully identified as 
belonging to a specific family may be considered false positive simply since they are 
too short and do not share the domain which is common to all other members in the 
family. Similarly, hypothetical proteins which share significant sequence similarity with 
a family are not necessarily false positives. Even proteins which are documented as 
belonging to a family, may be counted as false positives simply since they do not have 
exactly the family signature pattern, but rather a slightly modified one. For example, 
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cluster 27 has 139 proteins out of which 132 has the actins and actins like signature. 
Five of the other 7 proteins are documented as actins (and indeed show a remarkable 
similarity with other actins) and two are hypothetical proteins (again, with a strong 
similarity to actins). However, these 7 proteins do not have the actins and actins-like 
signature and therefore are counted, unjustly perhaps, as false positives. Similarly, 
cluster 7 has 46 proteins which do not have the rubisco large signature, and counted 
as false positives, but all are annotated in SWISSPROT as ribulose bisphosphate 
caxboxylase large chain (some are fragments). In the same way cluster 15 has 13 
proteins which do not have the cytochrome p450 signature, 10 of which are variants of 
cytochrome p460, 2 proteins are thromboxane-a synthase, and one is 
transrcinnaniate 4-monooxygenase., all are documented to be part of the cytochrome 
p450 family, and indeed, show a strong similarity with cytochromes p460. 
False positives of these types are very common in the present clusters. Should they 
be counted as false positives ? Obviously, some hits are real false positives, but there 
is no automatic way to detect which false positives are biologically meaningless and 
which are not. It can only be assumed that some false positives can be ignored. 
Consequently, the performance is expected to improve. An upper bound of the 
performance is given by the third index of quality (last column). With Qupper-bound = 
87.4% true positives. The actual quality is expected to be somewhere in between the 
76.7% given by Q,set and the 87.4%. 

Another factor, which may drastically decrease performance, is that some families 
were only characterized for subfamilies while other members in the family, are not 
known to share a well defined domain. In the evaluation procedure they will be 
considered as false positives, and since they may dominate then no cluster will be 
matched with the subfamily. One such example is the ran family which is a subfamily 
of the small G-proteins. The ran proteins are classified to the same cluster as the 
ras/ras-like/rab proteins (cluster 12). However, since the ras/ras-like/rab proteins are 
not characterized by the same signature pattern (nor are they characterized by any 
other signature pattern in release 13,0 of PROSITE), then they are considered as false 
positives. While this may be correct for the ran subfamily, by definition, the real 
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functional relationship with all the other proteins in this cluster may suggest that the 
definition of false positive is somewhat too strict. 

The families which are difficult to resolve in the present analysis (families for which 
performance is worst, e.g. less then 50% true positives) are dominated by short/local 
domains (e.g. PH domain, EGF, ER-TARGET, C1Q, KRINGLE, C2 domain, SH2. 
SH3) or domains that are paired with other, more abundant domains (e.g. opsin paired 
with G-protein-receptor). This is expected, since the present analysis is not a 
domain-based. 

The results of the evaluation procedure with Pfam as the reference database, lead to 
approximately the same conclusions: good performance for protein families, but short 
motifs are not detected well. The quality of the classification increased as the coverage 
(the total portion of the sequences which was included in the multiple alignment used 
to define the domain or the family in the Pfam database) increased, and for coverage > 
0.3 (134 families) the quality raised to Qsingle = 76.9% (6.1% false positives). Qset = 
84.9% (6.2% false positives) and Qupper-bound = 90.9%. For coverage > 0.5 (109 
families) the quality raised to Qsingle, = 80.6% (5% false positives), Qset = 88.3% (5% 
false positives) and Qupper-bound = 93.3%. 

New clusters 

The above evaluation procedure does not take into account the many new clusters in 
the present classification which do not have a counterpart cluster in PROSITE, nor in 
Pfam, Out of the 1006 clusters with more than 10 members each (total of 33682 
proteins), 308 clusters (6989 proteins, which are 20.8%) do not have a counterpart 
family in PROSITE database. 734 clusters (15586 proteins 46.3%) do not have a 
counterpart in Pfam A. 281 clusters are missing from both. 

The largest 20 unannotated clusters (both by PROSITE and Pfam) are listed in the 
table of Fig. 21. They are documented based on their SWISSPROT definition. The 
purity of these clusters (by means of definition consensus) is quite high. Still proteins 
in these clusters are not characterized in PROSITE nor in Pfam. 
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The method of the present invention as an analysis tool 

Tracing the formation of clusters 

A major aspect of the hierarchical organization is that clusters at a given threshold 
may merge at a more permissive threshold. This reflects the existence of subfamilies 
within a family, or families within a superfamily. 

By moving from one level to a more restrictive one, a subdivision of clusters into 
smaller subsets is obtained. These subsets suggest a natural division of the 
corresponding family, as illustrated in the following example for the transport system 
permease proteins. 

Figure 12 is a schematic illustration tracing the formation of clusters. Cluster 170 - the 
transport system permease proteins. In moving on to level 10'^ and further to level 
10"^° the cluster splits into few subclusters. Each circle stands for a cluster at 
threshold 10'^*^. Circles' radii are proportionate to the clusters sizes (numbers indicate 
clusters' sizes). The drawn edges appeared upon changing the threshold from 10'^° to 
the more pennissive 10'®. Edge widths are proportionate to the number of connections 
between the corresponding clusters. 

Cluster 170 at level 10*° with 46 proteins consists of transport system permease 
proteins. These proteins participate in multicomponent transport systems in bacteria. 
Specifically, they are the integral inner-membrane proteins which translocate the 
substrate across the membrane. 

The cluster decomposes into four subclusters at level 10'^^, which form a clique as in 
Fig. 12. These smaller subclusters correspond to the lactose/maltose transport system 
lacG/malG, the lactose/maltose transport system lacF/malF. the phosphate transport 
system, and other transport systems (of sulfate, molybdenum, spermidine and 
putrescine). The subgroups of lacG/malG and lacF/malF form already at level 10'^^. 
Some proteins that combine features from F and G subtypes are denoted in 
SWISSPROT as malGF proteins. However, based on this subclassification and the 
fact that the malG group and the malF group form at such high levels of significance, 
these proteins may be classified either to malG or to malF. 



37 



wo 99/39174 



PCT/IL99/00057 



Hierarchical organization within protein families and superfamilies 
The hierarchical organization also suggests classification within known, families. This 
classification is suggested by scanning the hierarchy over all levels, as illustrated for 
the small G-protein/Ras superfamily. 

The ras gene is a member of a family that have been found in tumor virus genomes 
and are responsible for the viruses' carcinogenic effect. In most cases this viral 
oncogene is closely related to a cellular counterpart, called proto-oncogene. infection 
by a retrovirus that carries a mutant form of the ras gene (ras oncogene), or mutation, 
can cause cell transformation. Indeed, mutations in ras gene are linked to many 
human cancers. 

The cellular ras protein binds guanine nucleotide and exhibits a GTPase activity. It 
participates in the regulation of cellular metabolism, survival and differentiation. In 
the last decade many additional proteins that are related to ras were discovered, all 
of which share the guanine nucleotide binding site. They are referred to as the 
smali-G-protein superfamily (Nuoffer & Baich 1994). This family of proteins has 
several subfamilies: ras, rab, ran, rho, ral, and smaller subfamilies. Like ras, these 
proteins participate in regulatory processes,such as vesicle trafficking (rab),and 
cytoskeleton organization (rho). 

Figure 17 is a schematic representation of the hierarchial organization of the 
small G-protein family. This family is composed of several subfamilies. A total of 229 
proteins, combined in cluster 12 (level 10'°), were grouped together into isolated sets 
at different levels of confidence, to form a natural subclassification of the family. This 
hierarchical organization is much enriched by combining possibly related clusters. 
The related clusters are connected by dashed lines. 

In Fig. 17 the relations within this family are depicted based on the present 
hierarchical organization. A total of 229 proteins, all from the small G-protein 
superfamily, are presented. All were clustered into cluster 12 at the lowest level of 
significance 10"°. Small clusters, which correspond to subfamilies, are formed at 
higher confidence levels, and fuse to larger clusters when the threshold is raised. 
The four main branches coincide with (I) rab subtypes (II) ras, ral and rap (III) rac, rho 
and cdc (cell division control proteins) (IV) ran. Interestingly, the linkage of rac to cdc 
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and rho seems stronger than that between ran and rho or rab and rho. This 
proposed subdivision suggests a connnnon root for all the subtypes, but splits them in 
a way that resembles the evolutionary tree of the small G-protein superfamily 
{Downward 1990). 

As include weaker similarities are included, other families are identified which are 
related to the small G-protein family. According to our map, the clusters which are 
detected as related clusters include cluster 29 (138 proteins), cluster 646 (14 
proteins) cluster 461 (20 proteins) and cluster 1400 (7 proteins). All these clusters 
are possibly -related clusters (rejected merges) of cluster 12 (see next section for a 
discussion on possibly related clusters). Cluster 29 consists of ADPribosylation 
factors family (APF) that are involved in vesicle budding, and of guanine nucleotide 
binding proteins from the sar subfamfly, whose members participate in a different 
type of vesicle budding. Most interestingly - Ga proteins of heterotrimeric G proteins 
belong to the same cluster. 

The connection between this cluster and cluster 12 is based on similarity of the ARFs 
and the rab subfamily, as shown in Fig. 17. Cluster 646 consists of the GTP-binding 
protein ERA and of thiophene/furan oxidation proteins (both being groups of 
GTP-binding proteins). This cluster and cluster 12 are related through the similarity of 
the thiophene/furan oxidation proteins and the ras subfamily. Cluster 461 
(GTP-binding proteins of the OBG family) and cluster 1400 (hypothetical small 
G-proteins) are not directly related to cluster 12. However, these clusters are related 
to clusters 29 and 646, as well as to each other (see Fig. 17). 

Possibly related clusters ar}d local maps 

The clustering algorithm automatically rejects many possible connections among 
clusters. This happens whenever the quality associated with this connection falls 
below a certain threshold (see Fig, 14). Many of these rejected connections are 
nevertheless meaningful and reflect genuine though distant homologies. The 
rejected mergers are referred to as possibly related clusters. 
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In examining a given cluster, much insight can be gained by observing those clusters 
which are possibly related to it. Even though some of these connections are 
justifiably rejected, in particular at the lowest level of confidence (10'°), many others 
do reflect structural/functional similarities, despite a weak sequence similarity. Based 
on the connections with possibly related clusters we can plot local maps (at this 
stage, mostly schematic) for the neighborhoods of protein families. These schematic 
maps can expose interesting relationships between protein families. Here we present 
a map for the immunoglobulins superfamily. 

Two of the big clusters in the tables of Fig. 11 belong to the immunoglobulin 
superfamily. These are cluster 4 (immunoglobulin V region) and cluster 5 
(immunoglobulins and major histocompatibility complex). 

Fig 11A shows clusters possibly related to cluster 4 (Level: 1e-0). Clusters are 
sorted by quality (i.e. minus the log of the geometric mean of similarity scores). Note 
that all clusters belong to the superfamily of immunoglobulins. 
Fig. 1 1 A shows those clusters which are possibly related to cluster 4, ordered by their 
quality value. These clusters include proteins which are involved in aspects of 
recognition at the immune system via the variable regions. 

Fig. 11B shows clusters possibly related to cluster 5 (Level: 1e-0). Only clusters with 
two members or more are shown, and are sorted by quality. Almost all clusters 
belong to the superfamily of immunoglobulins. 
Clusters 596 and 866 are probably unrelated (in italic). 

Likewise, the table of Fig. 118 shows the clusters which are possibly related to cluster 
5. These clusters, unlike the clusters related to cluster 4, consist mostly of proteins 
that adopted the immunoglobulin fold of the Ig constant region. Clusters which are 
suspected to be unrelated appear in italic (one can validate the significance of 
"possibly related clusters" using the quality of their relation and the alignments, and 
insignificant connections can be easily traced and ignored by a manual examination 
of the alignments). 

The two sets of clusters are mostly disjoint, with the exception of cluster 1796. 
Members of this cluster contain both regions, whence the cluster is related to both 
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clusters 4 and 5. The different parts of the proteins account for the appropriate 
relationships. 

Fig. 18 is a schematic map of the immunoglobulin superfamily. Ail clusters related to 
clusters 4 and 5 are shown (these are referred to as directly related clusters), as well 
as other dusters which are indirectly related clusters (clusters that are possibly related 
to directly related clusters). In this schematic map each cluster is represented as a 
circle, whose radius is proportional to the cluster's size. Only clusters' numbers are 
given. See Fig. 22 for more information on these clusters. Clusters on the left contain 
an Ig V-region. The group on the right has an Ig C-region. The clusters in between 
share both regions (except for duster 1468, that consists exclusively of V-region). 
There is also a direct connection between cluster 4 and cluster 5, The connection is 
based on 226 pairwise similarities between cluster 4 and 5. However, all the 
similarities are due to a single protein in cluster 5, a T-cell receptor beta chain 
(PII364). This protein has one V region aside of a C region, and the similarity with the 
226 proteins in cluster 4 is limited to the V region. No other protein in cluster 5 has a 
V region. Note that despite these similarities, clusters 4 and 6 did not merge, and the 
connection was automatically rejected. 

Figure 22 is a table showing clusters belonging to the immunoglobulin superfamily. 
Alt clusters appear in the local map of the immunoglobulins (Fig. 18). 
This information depicts the 'geometry' of the protein space in the vicinity of the 
immunoglobulin superfamily as in Fig. 18. This map includes also indirectly related 
clusters, i.e. possibly related clusters of order 2 and above (related clusters of related 
clusters, etc). The map includes almost all protein families which belong or are 
related to the immunoglobulin superfamily (see Fig, 22 for details), except for three 
clusters: cluster 373 (isolated cluster of periplasmic pilus chaperones). cluster 2172 
"(isolated cluster of THY-1 membrane glycoproteins) and cluster 2363 (B-lymphocyte 
antigen cd19). 

This "local map" of relationships is plotted to distinguish between three main groups. 
The left one corresponds to the variable regions (V domains) of the immunoglobulins, 
the right one to the constant regions (C domains). In the middle appear many 
clusters that are a mixture of the two types. 
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The evolutionary pathway between the structural classes of Ig and Ig-like domains 
are not yet established (Smith & Xue 1997). However, the many alternative 
connections between clusters whose proteins resemble V-domain to these of the 
C-domains indicate that adhesion molecules, fasciclin II and vascular CAM, are 
positioned between the classical V and C-regions. Indeed, these non-lg molecules 
were considered as the 1-set according to their intermediate nature (Haxpaz & 
Chothia 1994). 
Discussion 

The present invention relates to the problem of identifying high order features 
and organization within the protein sequence space. It is an object of the invention to 
exhaustively "chart" proteins, preferably all proteins, and to automatically classify 
them into families. According to the invention, based on pairvvise similarities, 
statistical regularities in the sequence space are sought, and, more concretely, 
clusters of protein sequences. According to the invention, in this high order 
organization most groups consist of sequences with a specific biological function or 
structure. 

A complete charting of the protein space is a daunting task and many 
difficulties are encountered. One preferably begins from well established statistical 
measures, in order to identify significant similarities. Great caution and biological 
expertise are needed to exclude connections which are unacceptable or misleading. 
The main difficulties stem from chance similarities among sequences and 
multi-domain-based connections. However, the sheer volume of data makes it 
necessary to develop automatic methods for such identification. 
The present invention addresses these major problems. According to the invention, 
one starts by creating, for each protein sequence, an exhaustive list of neighboring 
sequences. These lists take into account the scores of the three major methods for 
pairwise sequence comparisons. These three types of scores are jointly normalized 
and the lists are filtered. The link is maintained in the lists only when a significant 
relationship seems to exist. Then a two-phase clustering algorithm is applied to 
identify groups of related sequences. There are two pitfalls to avoid here: on the one 
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hand, it is very easy to follow a very strict rule and generate many small clusters, 
within each of which, the proteins are very closely related. This approach safely 
avoids the creation of false connections. However, it adds very little to the 
understanding of the protein space, since it includes only fairly obvious and well 
known connections. The other extreme strategy is not useful either: a procedure that 
declares a connection without sufficient scrutiny, does not miss interesting 
connections, but generates so many false connections that it becomes impossible to 
recognize significant relations. In other words, such a permissive method quickly 
collapses the whole space into a small number of gigantic meaningless clusters. The 
problem is to find a golden path where non obvious relationships are discovered 
without 'littering' these valuable connections with too many false connections. The 
method of the present invention employs an algorithm that can be described as a 
moderate version of the transitive closure algorithm. At each round of this process 
statistical information is gained on the relationships among current clusters. This 
information is then used to merge certain clusters, thus forming the next round of 
larger, coarser clusters. The algorithm starts from a very conservative classification, 
and is repeatedly applied, at varying levels of confidence, the input for each stage 
being the classification output at the previous stage. 

Finally, a hierarchical organization of ail protein sequences is obtained, strongly 
correlated with a functional partitioning of all proteins. This data structure reveals 
interesting relations between and within protein families, and provides a global view 
("map") of the space of all proteins. 

The classification consists of several thousand clusters, the largest of which contain 
several hundred members each. It is interesting to assess the effect of slowing down 
the transitive closure algorithm. Indeed, a straight forward application of the 
transitive closure (a.k.a, single-linkage clustering) algorithm leads to an avalanche as 
discussed above (details not shown). Already at confidence level 10'^°, most of the 
space is made up of a small number of very large clusters. This avalanche is caused 
by chance similarities and chains of domain-based connections that cause unrelated 
families to merge into few giant clusters. A major ingredient of the new algorithm is 
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the choice of rules for avoiding such undesirable incorrect connections and for 
preventing this collapse, and indeed most of them were eliminated. 
Some domain-based connections did escape the filters, and caused unrelated 
clusters to merge. Also, high scoring low-complexity segments which may be 
biologically meaningless can lead to false connections, and to the formation of 
nonhomogeneous clusters. Though we did take into account the effect of these 
segments, not all of them were filtered out. Since the goal was to detect many 
remote homologies weak filters were used in this case, and many similarities were 
considered at very low levels of confidence. At the moment, domain based 
connections can be easily traced manually, by observing the alignments. 

It is very difficult or even impossible to properly classify all proteins. The 
space of proteins has many different facets, all of which should be considered in 
future, more thorough classification: 3D biological function, biological function, 
domain content, cellular location, tissue specificity, metabolic pathways, etc. Here a 
functional partitioning was sought which is based on sequence analysis. This work 
differs from previous large scale analyses in several ways.- (i) There is no attempt to 
identify protein domains or motifs, (ii) No pre-defined groups or other classification 
are being employed in the analysis of the invention. Moreover, no multiple 
alignments of the proteins are needed, (iii) The space of all protein sequences is 
charted not just particular families, (iv) A global organization of all protein sequences 
is offered. 

The algorithm employed in the method of the present invention has turned out 
well defined groups which are strongly correlated with protein families and 
subfamilies when compared with the well accepted databases PROSITE and Pfam, 
the classification performed well for most of the families, though many of them were 
domain based rather than protein families. Many new clusters in the classification 
didn't have a match from either PROSITE nor Pfam. The hierarchical organization 
can indicate the existence of subfamilies within families, and the concept of possibly 
related clusters exposes distant relationships which reflect functional similarities. 
These relations offer a basis for sketching local maps near protein families. These 
connections can be of biological interest, and should be taken into account in the 
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Study of protein families. For example, an overall view of Fig. 17 or Fig. 18 innplies 
that tracing the hierarchical organization of a cluster and/or its rejected merges can 
provide new information. 

Another aspect of the possibly related clusters is that this list can be viewed as 
a "soft clustering", where the same protein can participate in several different 
clusters, Proteins that are composed of several domains, some of which shared by 
proteins from different families, are naturally associated with more than one group. 
Therefore, though in one embodiment of the invention a multi-domain protein is 
classified to a single cluster, its multi trait feature is tractable if the relations with the 
other clusters are examined as well. 
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What is claim is: 

1. An automatic method of classifying molecules having similar biologic 
function comprising the steps of: a) creating a hierarchical 
organization of said molecules in a database, wherein groups of 
clusters are identified using local consideration resulting in related 
clusters; b) determining the position of a selected molecule based on 
the hierarchical organization of step a, whereby selected molecules 
of similar biologic function are classified. 

2. An automatic method of classifying molecules having similar biologic 
function comprising the steps of: 

a) creating a hierarchical organization of the molecules in a 
database comprising the steps of: 

i calculating pairwise similarities between said 
molecules in said database by combining at least one standard 
measure of similarity, resulting in a first set of expectation values of 
similarity; 

ii analyzing said first set of expectation values of 
similarity so as to obtain a second set of expectation values of 
similarity, wherein said molecules of the second set have a high 
degree of similarity; 
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iii merging said resulting molecules of step ii) so as 
to form clusters, wherein only molecules with an expectation value 
below a first restricted threshold are merged; 

iv identifying groups of clusters using local 
consideration resulting in related clusters; 

v determining the relationship between related 
clusters in said groups; 

vi analyzing said groups of the related clusters of 
step v), thereby creating a hierarchical organization of said molecules; 

b) determining the position of a selected molecule based on 
the hierarchical organization of step a, comprising the steps of: 

i identifying pairwise similarities between said selected 
molecule to the molecules in said database by combining at least one 
standard measure of similarity, resulting in a third set of expectation 
values of similarity; 

it identifying geometric averaging of said selected molecule 
to each of said resulting clusters in said hierarchy, resulting in a forth 
set of expectation values of similarity; 

iii identifying related clusters from said hierarchy of step a, 
having a geometric averaging with said selected molecule below a 
second threshold, thereby classifying molecules having similar 
biologic function. 
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3. The method of claim 2. wherein each related cluster resulting from 
step iv is further analyzed, wherein if said geometric averaging of 
said related cluster to a second related cluster is below said first 
threshold a connection is established between two clusters. 

4. The method of claim 3, wherein said cluster is analyzed by 
calculating the geometric averaging between molecules in one 
duster to molecules in the other cluster. 

5. The method of claim 2, wherein the method of determining the 
relationship between said related clusters in step v further comprises 
the steps of applying a global test on said connected clusters for 
identifying nuclei of strong relationships within said groups of 
clusters, wherein each cluster is checked against its nearest cluster 
by using said geometric averaging, wherein if said geometric 
averaging is below said first threshold the two clusters merged. 

6. The method of claim 2, further comprises repeating steps iv, v while 
raising the threshold in a step wise manner until a third threshold is 
achieved. 

7. The method according to claim 2, wherein said groups containing 
the clusters of every two molecules from different clusters with 
expectation value below the threshold. 

8. The method according to claim 2, wherein the method further 
comprises applying three standard measures of similarity. 

9. The method according to claim 8, wherein said standard measures of 
similarity are Smith Waterman (SW), FASTA, and BLAST. 
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10. The method according to claim 9, further comprising the step of 
applying a numerical normalization to the SW, the FASTA and the 
BLAST, 

11. The method according to claim 10, wherein the method step a) ii) is 
analyzed using thresholds 0.1, 0.1 and 10"^ for SW, FASTA and 
BLAST respectively wherein a value of pairwise similarities between 
two protein sequences is maintained if either SW or FASTA yield an 
expectation value < 0.1 or BLAST'S expectation value is < 10"^. 

12. The method according to claim 11, further comprising filtering the 
results of BLAST thereby excluding low complexity segments, using 
the SEG program; or setting a more stringent threshold for BLAST at 
1 0"^ if filtering decreased the number of related sequences by half. 

13. The method according to claim 12, further comprising maintaining a 
value of an expectation value of similarity between two protein when 
all three methods yield an expectation value < 1 . 

14. The method of claim 1, wherein the molecules are nucleic acids. 

15. The method of claim 1, wherein the nucleic acids are DNA or RNA. 

16. The method of claim 1, wherein the molecules are polynucleotides. 

17. The method of claim 1, wherein the molecules are proteins. 

1 8. The method of claim 1 , wherein the molecules are peptides. 

19. The method of claim 1 , wherein the molecules are glycoproteins. 

20. The method of claim 1, wherein the molecules are complex sugars. 

21. The method of claim 1, wherein the molecules are immunoglobulins. 
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22. The method of claim 1, wherein the molecules are chemicals. 

23. The method of claim 1, wherein said database is the Swissprot 
database. 



24. The method of claim 2, wherein said first threshold is E 



100 



25. The method of claim 2, wherein said second threshold is E^°. 

26. The method of claim 2, wherein said third threshold is. E"^. 

27. A system for identifying biological families of molecules, said system 
comprising : 

a) a storage device containing a first database identifying 
molecules and their corresponding hierarchy; 

b) a processor connected to said storage device and in 
communication with a second database containing molecule 
sequences; 

an input device connected to said processor for inputting a 
selected molecule to be analyzed, wherein said processor is 
programmed to automatically classify molecules having similar 
biologic function. 
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